72 research outputs found

    Visualizing Gene Clusters using Neighborhood Graphs in R

    Get PDF
    The visualization of cluster solutions in gene expression data analysis gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results. Neighborhood graphs allow for visual assessment of relationships between adjacent clusters. The number of clusters in gene expression data is for biological reasons rather large. As a linear projection of the data into 2 dimensions does not scale well in the number of clusters there is a need for new visualization techniques using non-linear arrangement of the clusters. The new visualization tool is implemented in the open source statistical computing environment R. It is demonstrated on microarray data from yeast

    Exploratory and inferential analysis of gene cluster neighborhood graphs

    Get PDF
    Many different cluster methods are frequently used in gene expression data analysis to find groups of co–expressed genes. However, cluster algorithms with the ability to visualize the resulting clusters are usually preferred. The visualization of gene clusters gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results. In this paper recent extensions of R package gcExplorer are presented. gc-Explorer is an interactive visualization toolbox for the investigation of the overall cluster structure as well as single clusters. The different visualization options including arbitrary node and panel functions are described in detail. Finally the toolbox can be used to investigate the quality of a given clustering graphically as well as theoretically by testing the association between a partition and a functional group under study. It is shown that gcExplorer is a very helpful tool for a general exploration of microarray experiments. The identification of potentially interesting gene candidates or functional groups is substantially accelerated and eased. Inferential analysis on a cluster solution is used to judge its ability to provide insight into the underlying mechanistic biology of the experiment

    Interactive visualization of clusters in microarray data: an efficient tool for improved metabolic analysis of E. coli

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Interpretation of comprehensive DNA microarray data sets is a challenging task for biologists and process engineers where scientific assistance of statistics and bioinformatics is essential. Interdisciplinary cooperation and concerted development of software-tools for simplified and accelerated data analysis and interpretation is the key to overcome the bottleneck in data-analysis workflows. This approach is exemplified by <monospace>gcExplorer</monospace> an interactive visualization toolbox based on cluster analysis. Clustering is an important tool in gene expression data analysis to find groups of co-expressed genes which can finally suggest functional pathways and interactions between genes. The visualization of gene clusters gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results.</p> <p>Results</p> <p>In this study the interactive visualization toolbox <monospace>gcExplorer</monospace> is applied to the interpretation of <it>E. coli </it>microarray data. The data sets derive from two fedbatch experiments conducted in order to investigate the impact of different induction strategies on the host metabolism and product yield. The software enables direct graphical comparison of these two experiments. The identification of potentially interesting gene candidates or functional groups is substantially accelerated and eased.</p> <p>Conclusion</p> <p>It was shown that <monospace>gcExplorer</monospace> is a very helpful tool to gain a general overview of microarray experiments. Interesting gene expression patterns can easily be found, compared among different experiments and combined with information about gene function from publicly available databases.</p

    Mixtures of Regression Models for Time-Course Gene Expression Data: Evaluation of Initialization and Random Effects

    Get PDF
    Finite mixture models are routinely applied to time course microarray data. Due to the complexity and size of this type of data the choice of good starting values plays an important role. So far initialization strategies have only been investigated for data from a mixture of multivariate normal distributions. In this work several initialization procedures are evaluated for mixtures of regression models with and without random effects in an extensive simulation study on different artificial datasets. Finally these procedures are also applied to a real dataset from E. coli

    Prediction of the performance of pre-packed purification columns through machine learning

    Get PDF
    Pre‐packed columns have been increasingly used in process development and biomanufacturing thanks to their ease of use and consistency. Traditionally, packing quality is predicted through rate models, which require extensive calibration efforts through independent experiments to determine relevant mass transfer and kinetic rate constants. Here we propose machine learning as a complementary predictive tool for column performance. A machine learning algorithm, extreme gradient boosting, was applied to a large data set of packing quality (plate height and asymmetry) for pre‐packed columns as a function of quantitative parameters (column length, column diameter, and particle size) and qualitative attributes (backbone and functional mode). The machine learning model offered excellent predictive capabilities for the plate height and the asymmetry (90 and 93%, respectively), with packing quality strongly influenced by backbone (∼70% relative importance) and functional mode (∼15% relative importance), well above all other quantitative column parameters. The results highlight the ability of machine learning to provide reliable predictions of column performance from simple, generic parameters, including strategic qualitative parameters such as backbone and functionality, usually excluded from quantitative considerations. Our results will guide further efforts in column optimization, for example, by focusing on improvements of backbone and functional mode to obtain optimized packings

    The nucleotide composition of microsatellites impacts both replication fidelity and mismatch repair in human colorectal cells

    Get PDF
    Microsatellite instability is a key mechanism of colon carcinogenesis. We have previously studied mutations within a (CA)13 microsatellite using an enhanced green fluorescent protein (EGFP)-based reporter assay that allows the distinction of replication errors and mismatch repair (MMR) activity. Here we utilize this assay to compare mutations of mono- and dinucleotide repeats in human colorectal cells. HCT116 and HCT116+chr3 cells were stably transfected with EGFP-based plasmids harboring A10, G10, G16, (CA)13 and (CA)26 repeats. EGFP-positive mutant fractions were quantitated by flow cytometry, mutation rates were calculated and the mutant spectrum was analyzed by cycle sequencing. EGFP fluorescence pattern changed with the microsatellite's nucleotide sequence and cell type and clonal variations were observed in mononucleotide repeats. Replication errors (as calculated in HCT116) at A10 repeats were 5–10-fold higher than in G10, G16 were 30-fold higher than G10 and (CA)26 were 10-fold higher than (CA)13. The mutation rates in hMLH1-proficient HCT116+chr3 were 30–230-fold lower than in HCT116. MMR was more efficient in G16 than in A10 clones leading to a higher stability of poly-G tracts. Mutation spectra revealed predominantly 1-unit deletions in A10, (CA)13 and G10 and 2-unit deletions or 1-unit insertion in (CA)26. These findings indicate that both replication fidelity and MMR are affected by the microsatellite's nucleotide composition

    Using neighborhood graphs for the investigation of e. coli gene clusters

    Get PDF
    Clustering is commonly used in the analysis of geneexpression data to nd groups of co{expressed genes.The denition of gene clusters is not very clear asgenetic interactions are extremely complex. For thisreason the relationship between clusters is very importantas co{expressed genes can end up in dierentclusters. The neighborhood graph is a useful tool tovisualize the cluster structure. In this paper the Rpackage gcExplorer is presented which is an interactivetoolbox for the exploration of gene clusters. Additionalinformation about the gene clusters like theannotation of genes to functional groups (e.g., GOcategories) can easily be investigated. The new visualizationtoolbox is demonstrated on microarray datafrom E. coli

    R package gcExplorer: graphical and inferential exploration of cluster solutions

    Get PDF
    Cluster analysis is commonly applied to microarray data in order to find groups of co-expressed genes where cluster algorithms with the ability to visualize the resulting cluster objects (e.g., a dendrogram for hierarchical clustering) are usually preferred. The display of cluster solutions particularly for a large number of clusters is very important in exploratory data analysis. It gives practitioners an idea of the relationships between segments of a partition and allows to interpret the cluster results. Neighborhood graphs (Leisch, 2006) can be used for visual assessment of the cluster structure of centroid-based cluster solutions. In a neighborhood graph each node represents a cluster and two nodes are connected if there exist data points that have the two corresponding centroids as closest and second closest centroid. In this work we present new visualization methods based on the neighborhood graph. For node representation different plot symbols visualizing single clusters are used allowing a quick overview of the data. On the one hand the corresponding data points themselves can be visualized using for example line diagrams for gene expression over time. On the other hand node symbols like pie charts can be used to visualize further properties of the clusters like association to functional groups under study. Finally the neighborhood graph can be used for the validation of a cluster solution, e.g., by testing the relationship between a clustering and a priori information about gene functions. All visualization methods and test procedures used are implemented in R package gcExplorer (Scharl and Leisch, 2009) which is now available on CRAN. The grid-based node symbols are implemented in R package symbols (http://r-forge.r-project.org/projects/symbols/)

    Analysis of gene expression time-course data using cluster techniques

    No full text
    Zsfassung in dt. SpracheDiese Dissertation beschäftigt sich mit verschiedenen Aspekten der Cluster Analyse zur Auswertung von Zeitreihen Microarray Daten. Seit einigen Jahren ist die Interpretation von riesigen Datenmengen aus Microarray Experimenten eine große Herausforderung für die Statistik und Bioinformatik. Zeitreihen Microarray Experimente machen es möglich, die Genexpression von tausenden von Genen simultan zu studieren. Da Gene mit ähnlichem Expressionsmuster häufig auch koreguliert sind, kann das Clustern von Genexpressionsverläufen dabei helfen, koregulierte Gene zu finden. Letztendlich kann die Cluster Analyse dabei unterstützen, funktionale Stoffwechselwege und Interaktionen zwischen Genen zu finden.In dieser Dissertation werden sowohl partitionierende Cluster Methoden wie K-Means und der qualitätsbasierte Cluster Algorithmus QT-Clust als auch modellbasiertes Clustern untersucht. Es werden entweder die Originaldaten geclustert oder die funktionalen Daten. In der funktionalen Datenanalyse wird eine Kurve an jede Beobachtung angepasst, um die Zeitabhängigkeit zu berücksichtigen. In Simulationsstudien auf künstlichen Datensätzen werden die Eigenschaften unterschiedlicher Clustermethoden untersucht und auf ihre Nützlichkeit für Echtdaten getestet. Neue Clustermethoden für diese Art von Daten werden vorgestellt sowie einige Methoden zur Evaluierung von Clusterlösungen.Alle Cluster Algorithmen and Evaluierungsmethoden wurden in R implementiert, und alle Simulationen wurden in R durchführt.Ein wesentlicher Teil der Arbeit konzentriert sich auf die explorative Analyse von Clusterlösungen. Da genetische Interaktionen sehr komplex sind, ist die Definition von Genclustern schwierig. Beziehungen zwischen Clustern sind von großer Bedeutung, da koexprimierte Gene sehr leicht in unterschiedliche Cluster gruppiert werden können. Die Visualisierung von Clusterlösungen hilft dabei, ein besseres Verständnis für die Clusterstruktur der Daten zu bekommen und erleichtert die Interpretation der Clusterlösungen. Nachbarschaftsgraphen ermöglichen eine graphische Darstellung der Beziehungen zwischen angrenzenden Clustern.Unterschiedliche Visualisierungsmethoden zur interaktiven Untersuchung von Clusterlösungen wurden entwickelt und im R Paket gcExplorer implementiert. Die Funktionalität des Pakets beinhaltet die Visualisierung der Clusterstruktur, die Darstellung einzelner Cluster in Form von Graphiken oder HTML Tabellen, das Hervorheben bestimmter Eigenschaften von Clustern sowie einige Testprozeduren zur Beurteilung der Qualität von Clusterlösungen. Schließlich wird die Anwendung der verschiedenen Clustermethoden und die Verwendung des Pakets an mehreren Beispielen mit E. coli Daten vom Department für Biotechnologie an der Universität für Bodenkultur in Wien veranschaulicht.This thesis is concerned with different aspects of the analysis of gene expression time-course data using cluster techniques. The interpretation of enormous amounts of data from microarrays has been a challenging task in statistics and bioinformatics for the past few years. Time-course microarray experiments make it possible to look at the gene expression of thousands of genes at several time points simultaneously. Genes with similar expression pattern are likely to be co--regulated. Hence clustering gene expression patterns may help to find groups of co-regulated genes or to identify common temporal or spatial expression patterns. Finally cluster results can suggest functional pathways and interaction between genes.The cluster methods investigated in this thesis include partitioning cluster methods like the well-known K-Means or the quality-based cluster algorithm Stochastic QT-Clust as well as model-based clustering.Clustering is either carried out on the raw data or on functional data.In functional data analysis a curve is fit to each observation in order to account for time dependency. In simulation studies on artificial and real data sets from publicly available databases the properties of different cluster methods are compared and evaluated using the adjusted Rand index, the sum of within cluster distances as well as the likelihood criterion. Additionally, test procedures are developed allowing to judge the biological relevance of cluster solutions. All cluster algorithms and evaluation procedures are implemented in the statistical computing environment R and all simulations are performed in R.An essential part of this thesis deals with the visualization of cluster solutions. The definition of gene clusters is not very clear as genetic interactions are extremely complex. For this reason the relationships between clusters are very important as co-expressed genes can end up in different clusters. The visualization of cluster solutions helps to get an understanding of the cluster structure of the data and makes it easier to interpret the cluster results. Neighborhood graphs allow for visual assessment of relationships between adjacent clusters. A new visualization toolbox for the interactive exploration of cluster solutions is implemented in R package gcExplorer. The functionality of the package includes the visualization of the cluster structure in form of neighborhood graphs, the display of gene clusters in graphics or HTML tables, highlighting additional properties of the clusters as well as test procedures to judge the quality of cluster solutions. Finally, the methods are applied to E. coli data sets from the Department of Biotechnology at the University of Natural Resources and Applied Life Sciences in Vienna.17
    corecore